Sometimes, while analyzing a dataset, there can be some data present which might exert undue influence while building models, like linear regression. These data are called outliers. Outliers can sometimes mislead the set of data and influence model performance as well.
What are outliers?
In data science, outliers are values within a dataset that vary greatly from the others, they are either much larger, or significantly smaller. Outliers can appear in a dataset due to variability of measurement, error in data, experimental error etc. Outliers can cause machine learning models to make inaccurate predictions when they are included in the training data, so they need to be handled before training a model.
One of the best ways to understand outliers is box plots.
Boxplots are very useful to see the distribution of a variable/feature and detect outliers in them. It is a useful graphical representation for describing the behavior of the data in the middle as well as both ends of the distribution. A box plot shows the data based on the five-number summary:
Interquartile Range:
The difference between the lower quartile and the upper quartile(Q3 - Q1) is called the interquartile range or IQR.
Boxplots help us find the outliers in the data by using the IQR. As a rule, values that are outside the range of 1.5*IQR from Q1 and Q3 are regarded as outliers. The below image will help us better understand the outliers in our data.
In the image above, the points that are outside the whisker lines are the outliers.
There are different techniques to handle outliers in a dataset. In our example, we will use the concept of clipping (winsorizing).
What is winsorizing/clipping?
Clipping data from a dataset means to clip the data at the last permitted extreme value, e.g. the 5th or 95th percentile value. For example, when we clip the data to 95th percentile, values over the 95th percentile will be set to the 95th percentile value meaning all the values greater than 95% percent will equal to the 95th percentile value.
The following data set has several (bolded) extremes:
After clipping/winsorizing the top and bottom 10% of the data(matching those values to the nearest extreme), we get:
Let us solve a problem that removes outliers from data using clipping.
For illustration of the clipping method, lets look at an example.
We have a dataset named nyc_airbnb.csv , which
contains data about price of AirBnb per-night rental houses. In
the dataset, there exists some outliers in the
price variable. Our task is to find out the outliers
and remove them by winsorizing/clipping.
First , we load our dataset into a dataframe and view it.
Our first task will be to read the data from its source file. In this case a csv file named nyc_airbnb.csv.
To read the csv data, we need to
pandas library as pdnyc using
read_csv method in pandas
nyc.import pandas as pd
nyc= pd.read_csv("../datasets/nyc_airbnb.csv")
nyc
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
| 1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
| 2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
| 3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
| 4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48890 | 36484665 | Charming one bedroom - newly renovated rowhouse | 8232441 | Sabrina | Brooklyn | Bedford-Stuyvesant | 40.67853 | -73.94995 | Private room | 70 | 2 | 0 | NaN | NaN | 2 | 9 |
| 48891 | 36485057 | Affordable room in Bushwick/East Williamsburg | 6570630 | Marisol | Brooklyn | Bushwick | 40.70184 | -73.93317 | Private room | 40 | 4 | 0 | NaN | NaN | 2 | 36 |
| 48892 | 36485431 | Sunny Studio at Historical Neighborhood | 23492952 | Ilgar & Aysel | Manhattan | Harlem | 40.81475 | -73.94867 | Entire home/apt | 115 | 10 | 0 | NaN | NaN | 1 | 27 |
| 48893 | 36485609 | 43rd St. Time Square-cozy single bed | 30985759 | Taz | Manhattan | Hell's Kitchen | 40.75751 | -73.99112 | Shared room | 55 | 1 | 0 | NaN | NaN | 6 | 2 |
| 48894 | 36487245 | Trendy duplex in the very heart of Hell's Kitchen | 68119814 | Christophe | Manhattan | Hell's Kitchen | 40.76404 | -73.98933 | Private room | 90 | 7 | 0 | NaN | NaN | 1 | 23 |
48895 rows × 16 columns
price data¶
Since we are looking to find out the outliers in the hotel price ,
an effective way of finding outliers is using visualizations. To
see where the outliers lie in price data, strip plot
is a very useful graph to see how the datapoints are spread.
For our strip plot, we visualize every datapoint of the
price data. We look at the spread of
price data in the y axis. For this plot, we import
the plotly express library.
The steps are:
plotly.express library as
px
px, call the strip() method to
generate the strip plot
nyc: variable where the data is storedprice: column data to plot in the y axisprice_strip that
will save the plot in this variable
price_strip using the
show() method
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
price_strip = px.strip(nyc, y='price')
price_strip.show()